Background Information

Smoking is one of the leading causes of preventable deaths globally, contributing significantly to a wide range of health problems, including cardiovascular diseases, respiratory conditions, and various cancers. Over the years, many countries have implemented measures to reduce smoking prevalence, such as public health campaigns, tobacco taxes, smoking bans and regulations on tobacco advertising. Despite these efforts, smoking continues to affect millions of people worldwide, with its prevalence varying significantly across regions and populations. By investigating the prevalence of smoking among adults globally from 2000 to 2020, this study aims to explore and identify patterns or disparities in smoking behaviours. Understanding these trends is crucial for informing future public health strategies and reducing the global burden of tobacco-related diseases.


Research Question

How does the prevalence of smoking in adults across the world vary and what trends emerge from this analysis?


Project Organisation

The project is organised into key sections to help you navigate its contents. The Codebook folder provides detailed documentation on the dataset, including variable descriptions and structure. The Data Folder Contains the raw datasets. The Plots Folder contains Visualisations created during analysis. The Scripts Folder contains the code used for data processing, analysis, and visualisation.


Dataset

Dataset Origins

The raw dataset for this visualisation project comes from Multiple sources compiled by World Bank (2024) – processed by Our World In Data. “Prevalence of current tobacco use (% of adults)” dataset

This data represents the percentage of the population ages 15 years and over who currently use any tobacco product (smoked and/or smokeless tobacco) on a daily or non-daily basis. Tobacco products include cigarettes, pipes, cigars, cigarillos, waterpipes, bidis, kretek, heated tobacco products, and all forms of smokeless (oral and nasal) tobacco. Tobacco products exclude e-cigarettes (which do not contain tobacco), “e-cigars”, “e-hookahs”, JUUL and “e-pipes”. The rates are age-standardized to the WHO Standard Population.

Limitations

When interpreting the project’s results, it is important to consider the limitations of the dataset. Countries with irregular surveys or significant gaps in data contribute to large uncertainties in the estimates, making these results less reliable. Additionally, the dataset excludes information on e-cigarettes and other non-tobacco products, focusing solely on traditional tobacco use. These limitations mean that the findings should be interpreted with caution, particularly for countries with incomplete or inconsistent data, as the trends and prevalence rates presented may not fully capture the complexity of smoking behaviours globally.


Data Preparation

Installing Packages

# Packages to install and load
packages <- c("tidyverse", "ggplot2", "tidyr", "dplyr", "plotly", "rnaturalearth", "rnaturalearthdata", "sf", "htmlwidgets")

# Function to install packages and load them
install_and_load <- function(packages) {
  for (package in packages) {
    if (!require(package, character.only = TRUE)) {
      install.packages(package, dependencies = TRUE)
      library(package, character.only = TRUE)
    } else {
      library(package, character.only = TRUE)
    }
  }
}

# Run function
install_and_load(packages)

Loading Data

# Load raw data / replace "data/smoking.csv" with the path to CVS file. 
rawdata <- read.csv("data/smoking.csv")

Sanity Check

#sanity check 
str(rawdata)           # structure and summary
summary(rawdata)
head(rawdata)          # Check first rows
dim(rawdata)           # Check dimensions
colSums(is.na(rawdata))# Check missing values

# Check for duplicates
duplicates <- rawdata[duplicated(rawdata), ]
print(duplicates)

# Unique values in key columns
unique(rawdata$Entity)
unique(rawdata$Year)

Cleaning Data

The dataset initially included various entities, such as global regions and income levels. To focus on country-level analysis, these entities were excluded. Additionally, the data was streamlined to 5-year intervals for clarity, which involved removing the years 2018 and 2019. For simplicity and readability, the variable representing prevalence was also renamed.

# Cleaning data 
# Remove specific entities
countries_data <- rawdata[!rawdata$Entity %in% c("East Asia and Pacific (WB)", "Sub-Saharan Africa (WB)", 
                                                 "Upper-middle-income countries", "Europe and Central Asia (WB)", 
                                                 "World", "European Union (27)", "Low-income countries", 
                                                 "Lower-middle-income countries", "Middle East and North Africa (WB)", 
                                                 "Middle-income countries", "North America (WB)", "South Asia (WB)", 
                                                 "Latin America and Caribbean (WB)", "High-income countries"), ]

# Exclude specific years
countries_data <- countries_data %>%
  filter(!(Year %in% c(2018, 2019)))

# Rename column for ease
countries_data <- countries_data %>%
  rename(Prevalence = Prevalence.of.current.tobacco.use....of.adults.)

Load and Clean to merge

In this section, the world map data was loaded using the ne_countries() function, which provides a geospatial dataset containing geographic and country-level information. To ensure compatibility with the smoking data, discrepancies in ISO country codes between the world map dataset (world) and the smoking prevalence dataset (countries_data) were identified. Specifically, any codes present in countries_data but missing in world, and vice versa, were highlighted. To address these discrepancies the function fix_iso_codes() was created to correct mismatched ISO codes.

# Load world map data
world <- ne_countries(scale = "medium", returnclass = "sf")

# Identify codes in countries_data but not in world
missing_in_world <- setdiff(countries_data$Code, world$iso_a3)
print(missing_in_world)

# Identify codes in world but not in countries_data
missing_in_data <- setdiff(world$iso_a3, countries_data$Code)
print(missing_in_data)

# Fix missing ISO codes
fix_iso_codes <- function(world_data) {
  world_data %>%
    mutate(iso_a3 = ifelse(name == "France", "FRA", iso_a3)) %>%
    mutate(iso_a3 = ifelse(name == "Norway", "NOR", iso_a3))
}
world <- fix_iso_codes(world)

Merge Data

To prepare the datasets for visualisation, the world map data and the cleaned smoking prevalence data were merged using the left_join() function. The datasets were aligned by matching the ISO country codes (iso_a3 in the world dataset and Code in the smoking dataset). During this process, any missing prevalence values were replaced with 0 using the mutate() function, ensuring that all countries had defined values for visualisation, even if data was unavailable. Duplicate entries were identified and removed to maintain data integrity.

To make the data compatible with interactive visualisations in Plotly, the geometric information in the sf object was removed using the st_set_geometry(NULL) function. A detailed inspection of the merged dataset was performed. Finally, distributions of data across years and countries were tabulated to confirm consistency and readiness for visualisation. This step ensured a clean, structured dataset suitable for creating accurate and informative visualisations.

# Merge datasets, sanity check/clean merged data 
# Merge datasets
map_data <- world %>%
  left_join(countries_data, by = c("iso_a3" = "Code"))

# Replace NA prevalence values with 0
map_data <- map_data %>%
  mutate(Prevalence = ifelse(is.na(Prevalence), 0, Prevalence))

# Remove duplicates
map_data <- map_data[!duplicated(map_data), ]

# Remove geometry for Plotly compatibility
plot_data <- map_data %>%
  st_set_geometry(NULL)

# Structure and summary of cleaned dataset
str(plot_data)
summary(plot_data)

# Check for duplicates country-year pairs
duplicates <- plot_data %>%
  group_by(iso_a3, Year) %>%
  filter(n() > 1)
print(duplicates)

# Check for missing prevalence values
missing_prevalence <- plot_data %>%
  filter(is.na(Prevalence))
print(missing_prevalence)

# Distribution of years and countries
table(plot_data$Year)
table(plot_data$iso_a3)

Description of Data

The cleaned dataset consists of country-level information on smoking prevalence among adults from 2000 to 2020, focusing on 5-year intervals. Key variables include Entity, which represents the name of each country; Code, the unique ISO-3 country code used for data alignment; Year, indicating the specific year of observation; and Prevalence, which provides the percentage of the adult population currently using tobacco products. The data has been cleaned to exclude irrelevant entities, ensuring it accurately reflects country-level trends. Furthermore, missing values in prevalence have been accounted for, and duplicate entries removed, resulting in a comprehensive and structured dataset ideal for visualising smoking trends over time.


Initial Visualisations

First visualisation

The initial visualisation provided a basic representation of smoking prevalence on a static map. While it successfully displayed the prevalence rates for countries, it lacked interactivity and failed to account for changes across years. Users could not explore how smoking prevalence evolved over time or access specific data points for each country. The map presented a single snapshot without the ability to delve into detailed information, such as individual country names or exact prevalence percentages. These limitations highlighted the need for a more dynamic and informative visualisation to enhance user engagement and understanding.

# first visualisation
ggplot(data = map_data) +
  geom_sf(aes(fill = Prevalence), color = "white", size = 0.2) +
  scale_fill_gradient(low = "blue", high = "red", na.value = "grey") +
  labs(title = "Global Smoking Prevalence Heatmap",
       fill = "Prevalence (%)") +
  theme_minimal()


Second visualisation

My next visualisation aimed to be more interactive. This time, I managed to get individual interactive visualisations for each year. However, this did not allow us to visualise the evolution across time in a single plot.

# Function to create an interactive plot for a specific year

# Create a list of subsets by year
yearly_subcategories <- split(plot_data, plot_data$Year)

create_interactive_year_plot <- function(data, year) {
  plot <- ggplot(data = data %>% filter(Year == year)) +
    geom_sf(aes(fill = Prevalence, text = paste("Country:", Entity, "<br>Prevalence:", Prevalence, "%")),
            size = 0.2) +  # Removed black borders
    scale_fill_gradient(low = "lightyellow", high = "darkred", na.value = "white", name = "Prevalence (%)") +
    labs(
      title = paste("Global Smoking Prevalence -", year),
      fill = "Prevalence (%)"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5, size = 16),
      legend.position = "bottom",
      axis.title = element_blank(),
      axis.text = element_blank(),
      axis.ticks = element_blank(),
      panel.grid = element_blank()
    )
  
  # Convert ggplot to interactive plotly object with hover text
  ggplotly(plot, tooltip = c("text"))
}

# Create interactive visualisations for each year
interactive_plot_2000 <- create_interactive_year_plot(map_data, 2000)
interactive_plot_2005 <- create_interactive_year_plot(map_data, 2005)
interactive_plot_2010 <- create_interactive_year_plot(map_data, 2010)
interactive_plot_2015 <- create_interactive_year_plot(map_data, 2015)
interactive_plot_2020 <- create_interactive_year_plot(map_data, 2020)

second visualisation

2000

2005

2010

2015

2020

Final Visualisation

To create my final visualisation, I combined the previous data sets of the individual years into one single data set.

For the final visualisation, I needed to merge the individual datasets for each year back into a single dataset. Initially, I used the data set countries_data to create my visualisation, however it resulted in unexpected behaviour when combined with the plotly package. Seperating and combining the datasets allowed me to correct the visualisation. Additionally, I added a new colour scheme with adjusted thresholds to better differentiate smoking prevalence rates across countries. This approach improved clarity and enhanced the visual impact of the data. I also incorporated all years into a unified visualisation of all years, which allows us to observe trends over time seamlessly. Finally, I added descriptive titles and annotations to provide context and guide interpretation, making the visualisation more informative.

# Combine all yearly subcategories back into a single dataset
combined_data_year <- bind_rows(yearly_subcategories)

# Enhanced plot
final_plot <- plot_ly(
  data = combined_data_year,
  type = "choropleth",
  locations = ~iso_a3,
  locationmode = "ISO-3",
  z = ~Prevalence,
  frame = ~Year,
  text = ~ifelse(
    is.na(Prevalence),
    NA,  
    paste("Country:", Entity, "<br>Prevalence:", Prevalence, "%")  # Show prevalence for available data
  ),
  hoverinfo = "text",  # Show only custom hover text
  colorscale = list(
    list(0, "#ffeda0"),   # Light yellow for low prevalence
    list(0.2, "#feb24c"), # Light orange
    list(0.4, "#fd8d3c"), # Orange
    list(0.6, "#fc4e2a"), # Reddish-orange
    list(0.8, "#e31a1c"), # Red
    list(1, "#67000d")    # Dark red for high prevalence
  ),
  zmin = 0,
  zmax = 70,  # Adjusted prevalence
  showscale = TRUE,
  marker = list(line = list(color = "grey", width = 0.5))  # Border for countries
) %>%
  layout(
    title = "Global Smoking Prevalence in Adults (2000-2020)",
    geo = list(
      projection = list(type = "equirectangular"),  # Rectangular projection
      showcoastlines = TRUE,
      coastlinecolor = "grey",                     # Coastline color
      showcountries = TRUE,                        # country borders
      countrycolor = "grey",                       # Country border color
      showland = TRUE,                             # Highlight land 
      landcolor = "white",                         # Land color
      showocean = TRUE,                            # Highlight oceans
      oceancolor = "lightblue",                    # Ocean color
      showframe = FALSE                            # Remove map frame
    ),
    annotations = list(
      # Note annotation
      list(
        x = 0.5,                                   # Horizontal 
        y = -0.1,                                  # Vertical  (below the map)
        xref = "paper",                            # x-axis
        yref = "paper",                            # y-axis
        text = "Note: White regions indicate missing data.", # Annotation 
        showarrow = FALSE,                         # No arrow
        font = list(size = 12, color = "black"),   # Font 
        align = "left"
      ),
      # Source annotation
      list(
        x = 0.5,                                   # Horizontal 
        y = -0.15,                                 # Vertical (below the map)
        xref = "paper",                            # x-axis
        yref = "paper",                            # y-axis
        text = "Source: Multiple sources compiled by World Bank (2024) – processed by Our World In Data",
        showarrow = FALSE,                         # No arrow
        font = list(size = 10, color = "black"),   # Font
        align = "center"
      )
    )
  )

# Show  plot
final_plot 

References

Data : Multiple sources compiled by World Bank (2024) – processed by Our World in Data. “Prevalence of current tobacco use (% of adults)” dataset. World Health Organization (via World Bank), “World Development Indicators” [original data]. Retrieved December 9, 2024 from https://ourworldindata.org/grapher/share-of-adults-who-smoke